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C^ ■ Summary: This article proposes a novel approach to statistical alignment of nucleotide sequences by introducing a context 

dependent structure on the substitution process in the underlying evolutionary model. We propose to estimate alignments 

and context dependent mutation rates relying on the observation of two homologous sequences. The procedure is based on a 

generalized pair-hidden Markov structure, where conditional on the alignment path, the nucleotide sequences follow a Markov 

distribution. We use a stochastic approximation expectation maximization (saem) algorithm to give accurate estimators of 

^ ' parameters and alignments. We provide results both on simulated data and vertebrate genomes, which are known to have a 

QQ ' high mutation rate from CG dinucleotide. In particular, we establish that the method improves the accuracy of the alignment 

O^ , of a human pseudogene and its functional gene. 
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1. Introduction 

Alignment of DNA sequences is the core process in comparative genomics. An alignment is a mapping of the nucleotides in one 
. , sequence onto the nucleotides in the other sequence which captures two important evolutionary processes: the substitution (of 

rrt ' a nucleotide, by another one) and the insertion or deletion (indel, of one or several nucleotides). Thus, in this mapping, two 

matched nucleotides are supposed to derive from a common ancestor (they are homologous) and the mapping also allows for 
gaps which may be introduced into one or the other sequence. Several powerful alignment algorithms have been developed to 
align two or more sequences, a vast majority of them relying on dynamic programming methods that optimize an alignment 
score function. While score-based alignment procedures appear to be very rapid (an important point regarding to the current 
sizes of the databases), they have at least two major flaws: i) The choice of the score parameters requires some biological 
expertise and is non objective. This is particularly problematic since the resulting alignments are not robust to this choice 
(jDewev et al.l . |2006| ). ii) The highest scoring alignment might not be the most relevant from a biological point of view. One 
should then examine the fc-highest scoring alignments (with the additional issue o f selecting a value for k) but this approach 
often proves imprac tical because o f the sheer number of suboptimal alignments () Waterman and Eggertl . 119871 1. We refer to 
Chapters 2 and 4 in lDurbin et al.l (|l998l l for an introduction to sequence alignment. 

Contrarily to score-based alignment procedures, statistical (or probabilistic) alignment methods rely on an explicit modeling 
of the two evolutionary processes at the core of an alignment, namely the substitution and the indel processes. While there is 
a general agreement that substitution events are satisfactorily modeled relying on continuous time Markov processes on the 
nucleotide state space, very few indel models have been proposed in the literature and no ne of them has becom e a standard. 
The first evolutionary model for biological sequences dealing with indels was formulated bv lThorne et all (|l99ll ). hereafter the 
TKF91 model. It provides a basis for performing alignment within a statistical framework: the optimal alignment is obtained 
as the evolutionary path that maximizes the likelihood of observing the sequences as being derived from a common ancestor 
(under this model). This has at least two major advantages over the score-based method. First, in this context, the evolutionary 
parameters (which play the role of an underlying scoring function) are not chosen by the user but rather estimated from the 
data, in a maximum likelihood framework. Second, a posterior probability is obtained on the set of alignments: the analysis 
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does not concentrate on the highest probable path and rather provides sample alignments from this posterior distribution. 
Moreover, a confidence measure may be assigned to each position i n the alignment. T his is particularly relevant as alignment 
uncertainty is known to be a major flaw in comparative genomics (|Wong et al.l . 120081 1. 

St atistical alignment is performed under the convenient pair- hidden Markov model, hereafter pair-HMM (see Arribas-Gil et al.l . 



2006l : lDurbin et al.l . ll998l ). enabling the use of generalizations of the expectation-maximization (em) algorithm (|Dempster et al 



19771 ). Some drawbacks of the pre liminary TKF91 pro posal have first been improved by the same authors in what is called 



the TKF92 ve rsion of the model (IThorne et al.l. 19921). Then, the origi n al models have been later refi n ed in many ways, a s 



for instance in I Arribas-Gil et all (|2009t ): lKnudsen and Mivamotd (|2003t ): iMet^ (|2003l '): iMiklosI (|2003h : iMiklos et al.1 (12004 ): 



iMiklos and Toroczkail ( 2001f ). We underline that an important characteristic of all those models is that the i ndel and the 



substi tution processes are handled independently. An introduction to statistical alignment may be found in iLunter et al.l 

(Eooi). 

It is important to note that the original TKF91 model and all the later developments are primarily focused on modeling 
the indel process. As for the under lying substitution mode l, pair-HMMs procedures mostly rely on the simplest one: the 
one-parameter Jukes-Cantor model (jJukes and Cantoij . Il969l ). Surprisingly, while models of substi tution processes h ave made 
important progress towards a more accurate description of reality in the past years (see for instance lFelsensteirJ . |2003| . Chapter 
13), these developments have had almost no impact in the statistical alignment literature. One of the reasons for that may 
be the existence of important biases found in sequence alignment and resulting from incorrectly inferred indels. For instance, 
iLunter et al.l (|2007|) exhibit three major classes of such biases: (1) gap wander, resulting from the incorrect placement of gaps 
due to spurious non homologous similarity; (2) gap attraction, resulting in the joining of two closely positioned gaps into 
one larger gap; and (3) gap annihilati on, resulting in the deletio n of two indels of equal size for a typically more favorable 
representation as substitutions (see also lHolmes and Durbinlll998l ). This might explain a stronger focus on accurate estimation 
of indels positions. Another reason could be that since evolutionary parameters are to be estimated in statistical alignments 
procedures, their number should remain quite low. 

On the contrary, score-based alignment methods offer some variety in the form of the match scoring functions, which is 
the counterpart of the substitution process parameters when relying on statistical alignment. Note that in this context, using 
different types of indels scores is hardly possible, since the dynamic programming approach limits those scores to affine (in 
the gap length) penalties. Nonetheless, the match scores may take different forms. One of the most recent improvement s 
in score-based alignment proced ures conc erns the deve lopment of context-based scoring schemes as in lCambin et al.l (120061 1: 
iGambin and Woitalewica (|2007l ) (see also iHuand . Il994l . as a much earlier attempt). In these works, the cost of a substitution 
depends on the surrounding nucleotides. As a consequence, the score of an alignment depends on the order of editing operations. 
Another drawback of this approach lies in the issue of computing a p-value associated to the alignment significance. Indeed, 
in order to assess the statistical significance of an alignment, one requires the knowledge of the tail distribution of the 
score statistic, under the null hypothesis of non homology between the sequences. While this distribution is not even fully 
characterized in the simpler case of site independent scoring schemes allowing gaps, it is currently out of reach when introducing 
dependencies in the substitution cost. 

Nonetheless, allowing for context dependence in the modeling of the evolutionary processes is important as it captures 
local sequence dependent fluctuations in substitution rates. It is well-known for instance that, at least in vertebrates, there 
is a higher level of methylated cytosine in CG dinucleotides (denoted CpG). This methylated cytosine mutates abnormally 
frequently to thymine which results in higher substitution rates form the pair CpG dBulme r. 1986). At the mutationa l level, 
context dependent substitution rates can have a significant impact on prediction accuracy (|Siepel and Haussleij . |200J) . 

Note that allowing for context dependent indel rates is also a challenging issue. For instance, microsatellite expansion and 
contraction is known to be a context dependent process (see for instance some recent results in IVarela and Amoj. 20091) and 



suggest s the use of context dependent indel rates. Such an attempt has been very recently made by iHickev and Blanchettd 
(|2011al lbl). relying on tree adjoining grammars. However, introducing locally dependent indel rates in an evolutionary process 
implies that the substitution and the indel processes may not anymore be handled separately. This independence between the 
two processes is a key assumption in limiting the computational cost of pair-HMMs and removing this hypothesis is beyond 
the scope of the present work. Thus, we limit our approach to modeling context dependent substitution rates. 

In this work, we formulate a context dependent pair hidden Markov model, so as to perform statistical alignment of 
two sequences allowing for context dependent substitution rates. We then propose a method to perform parameter max- 
imum likelihood esti mation as well as posterior sam pling within this model. Our procedure relies on stochastic versions 
(ICeleux and Dieboltl. 198 5: Dic bolt and Celeuxl . Il993l ) of the em algorithm. A similar method has already been successfully 
used in lArribas-Gil et al.l (|2009l ) in a classical pair-HM M (with no con text dependent substitution rates) in which case it 
gives results very similar to Gibbs sampling. Note that iHolmea (|2005l ) also suggests the use of a stochastic version of em 
for estimating indel rates in the multiple alignment framework, assuming the alignment is unknown. In the same spirit, 
iHobolthI (|2008l ) proposes an MCMC-EM algorithm for the estimation of neighbor-dependent substitution rates from a multiple 
alignment and a phylogenetic tree. The advantages of stochastic versions of em algorithm is that they allow to perform 
maximum likelihood estimation in a hidden variables framework where the maximization is not explicit, the computation 
of the likelihood is computati onally very expensive, and wh ere the use of the classical em algorithm is not feasible. To our 
knowledge, this work is, with iHickev and Blanchettd (|2011al )'s approach, one of the first attempts to perform a contextual 
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statistical alignment. 



This article is organized as follows. In Section [21 we first describe a context dependent pair hidden Markov model. In this 
framework, parameter maximum likelihood estimation combined with hidden states posterior sampling provides (a posterior 
probability on) contextual alignments of two sequences which is data-driven in the sense that it does not require any parameter 
choice. Section[3]briefly discusses asymptotic consistency results that can be obtained on the parameter estimation procedure. 
Then, Section [4] describes two versions of the em algorithm applied in this framework of contextual statistical alignment: the 
stoch astic expectation maximization (sem) and the stochastic approximation expectation maximization (saem) (JDelvon et al.l . 
I1999I ) algorithms. The performance of these procedures are illustrated on synthetic data in Section \E\ and a real data set is 
handled in Section [6] 



2. Model 

A pair-hidden Markov model is described by the distribution of a non-observed increasing path through the two-dimensional 
integer lattice N x N and given this hidden path, the conditional distribution of a pair of observed sequences Xi-.n := 
Xi, . . . , X„ G A^ and Yi-.m '■= Vi, . . . , Ym G A"^, with values on a finite set A (the nucleotides alphabet). The path reflects 
the insertion-deletion process acting on the sequences and corresponds to the bare alignment, namely the alignment without 
specification of the nucleotides. It is the sum of random moves from the set £ = {(1, 1), (1, 0), (0, 1)} also denoted {M, Ix, 1y}, 
where M stands for 'match', Ix stands for 'insert in X' and similarly ly stands for 'insert in Y'. This path represents an 
empty alignment of the sequences, as the move A'l represents an homologous site between the two sequences, the move Ix 
represents either an insertion in the first sequence {Xi;„) or a deletion in the second (Yi-.m) and similarly for the move ly (see 
Figure [T]) . 
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Figure 1. Graphical representation of an alignment between two sequences X = AATG and Y = CTGG. The displayed 
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alignment is c - tg G . 

We assume that the hidden path follows a Markov distribution without sp ecifying from which insertion-deletion process it 
comes from. This could be fo r instance th e TKF91 insertion-deletion process (Thornc ct al., I99I) o r any of its later variants 
(|Arribas-Gil et al.1 . [2OO9I : JKnu dscn and M ivamotd . [2003: Motzlcr, 2003; Miklos, 2003; Thornc ct al.,. Il992l l. Given the empty 
alignment, we then describe the conditional distribution of the sequence letters. In a 'match' position, this distribution reflects 
the substitution process while in 'insert in X' or 'insert in Y' positions, this distribution reflects both the substitution process 
and the distribution of inserted segments. 

More precisely, let {et}t^o be an ergodic, homogeneous and stationary Markov chain with state space £ — {(1, 1), (1, 0), (0, 1)} 
and transition probability tt (a stochastic matrix of size 3x3). For notational convenience, the state space £ is sometimes 
equivalently denoted by {AI, Ix, Iy}- In this context, we might use obvious notations as for instance timm for the conditional 
probability r{et = (1, l)|et-i = (1, 1)). 

Let Zt = {Nt, Alt) ~ X]!=i £s be the path induced by this Markov chain through the lattice N x N. In the following, a path 
through the lattice N x N is always assumed to be increasing, namely of the form ^^ e^ for some sequence e G f'*. We denote 
by P^ the corresponding stationary probability and by £n,m the set of paths in N x N starting at (0, 0) and ending at (n, m). 
For any e G £n,m, we also let |e| denote the length of e, which satisfies n\/ m ^ \e\ ^ n + m. 

The hidden process generates the observed sequences in the following way. Given the hidden path {et}t^o, the sequences 
{Xi;n, Yi;m} are generated according to an order one Markov chain: for any e G £n,m, 



P(Xi:„, Fi.„|e = e) = n IP(^JV, , Ym, le^, e,_i , Xn,_, , Ym,_,). 



^{Xms 1 Yms \^a, Gs-l, -^JV3_i , Y^■ 
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Note that the indexes Ns , Ms of the observations generated at ti me s are random. This is in sharp contrast with the classical 
hidden Markov model (HMM) and is specific to pair-HMMs (see lArribas-Gil et alj . |2006| . for more details) . 

In this work, we restrict our attention to order-one Markov conditional distributions. Note that the following developments 
could be generalized to higher order Markov chains at the cost of increased computational burden. 

As we want to model the dependency in the substitution process only, we further constrain the model so that the dependency 
occurs only in successive homologous (match) positions. We thus use the following parametrization 

h(XN^,YMjXN^_i,YM,_i) if Cs = es-i = (1,1), 

h(XN,,YM^) ife, = (l,l)/e,_i, 

/(^ivj if e. = (1,0), 

I g{YM,) if e. = (0,1), 

where /, g are probability measures (p.m.) on A; function /i is a p.m. on A^ and h may be viewed as a stochastic matrix on 
A^. 

Obvious necessary conditions for the parameters to be identifiable in this model are the following 

fi) 3a, b £ A such that h{a, b) ^ f{a)g{b), 
ii) 3a, 6, c, d G ^ such that /i(a, b|c, d) 7^ /(a)3(6), (1) 

Hi) 3a, b,c,d £ A such that h{a, &|c, d) ^ h(a, b). 

The whole parameter set is then given by 

e = {e = (tt, /, g, h,h);e satisfies ©}. 

St atistical alignment of two sequences consists in maximizing with respect to 6 the following criterion w„^rn{6) (|Durbin et al.l . 
Il998l ). This criterion plays the r ole of a log-likelihood in the pair-HMM. (For a discussion on the quantities playing the role 
of likelihoods in pair-HMMs, see I Arribas-Gil et al.l . l200(i ). For any integers n,m ^ 1 and any observations Xi:„ and Yi-.m, let 

w„,mie) ■-logQg{Xi:„,Yi..m) := logPe(3s^ l,Zs = {n,m);Xi:n,Yi:m). (2) 

The criterion Wn,m is more explicitly defined as 

w^MO) = log ( E E P-(^- - «-) X { n /(^nJ^<='=-'^''">5(F„.J^^^^=(°'^'> 

s — nVm ee^Ti. in jle|—s A; — 1 

where (71^,771^) = X]t=i ^'- Now, we define the parameter estimator as 

On,m = argmaxTO„,m(6l). 

In the classical pair-HMM (namely without accounting for de p enden cies in the substitution process), consistency results for 
the above estimator 6n,m were obtained in lArribas-Gil et al.l (|2006r ) and are extended here to the context dependent case 
in Section [31 Then in Section 3] we develop an saem algorithm to compute 6n,m and provide an a posteriori probability 
distribution over the set of alignments. Indeed, a main issue is to obtain an ac curate alignment of the sequences. Score-based 
alignment methods heavily rely on the Viterbi algorithm for this purpose (see lDurbin et al.l . 119981 ). which provides the most 
probable a posteriori alignment. However, optimal alignments often look different from typical ones, and providing a unique 
alignment as the result of the estimation procedure may not be very informative. For this reason, it may be more interesting 
to provide the probability distribution of hidden states over each pair of nucleotides from the observed sequences (for the 
parameter value dn,m)- T his can be done under th e pair-HMM framework and gives us a reliability measure of any alignment 
of the two sequences (see lArribas-Gil et al.l . [2009|) . 



3. Consistency results for parameter estimation 

In this part, we give generalizations of results first obtained in I Arribas-Gil et al.l (|2006l ). to the context dependent pair-HMM. 
As the proofs follow the same lines as in this reference, we shall omit them. We first introduce some notations and formulate 
some assumptions that will be needed. For any 5 > 0, we let 

05 = {0 e O-yk, Ok > 5} and 60 = {6» G 6; Vfc, 9k >Q}. 

The true parameter value will be denoted by 9o and is assumed to belong to the set So. Probabilities and expectations under 
this parameter value are denoted by Po and Eo, respectively. We now introduce a notation for the marginal distributions of 
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h and h. For any b,c £ A, let 

hx(-) ■- Y^ h{-,a), hyi-) ■- ^/i(a, •), 

hx{-\b, c) := Y^ h(-, a\b, c), hY{-\b, c) := ^ h{a,-\b, c). 
We then let Bmarg be a subset of parameters in 6o satisfying some assumptions on the marginals of h and h. 

Omarg = (^ ^ Oq; Vfc, C G ^, /iX = f , ky ^g,hx{-\b,c) = f {■) ,hY {■\b , c) ^ g{-)\ ■ 

Note that for instance, the parametrization used in Section [5] satisfies 6 £ Omarg- Finally, we introduce the process wt defined 
in a similar way as in ([2]) by 

■Wt{9) ■— logQe{Xl:Mt ,Yl:Mt) ■ 

As previously noted, the length t of an alignment of sequences with respective sizes n and m satisfies nVm^t^n + m. 
Thus, asymptotic results for t — >■ +oo will imply equivalent ones for n,m ^ +oo. In other words, consistency results obtained 
when t — >■ +0O are valid for long enough observed sequences, even if one does not know the length t of the true underlying 
alignment. We now state the main theorem. 

Theorem 1: For any 6 e 9o, we have 

i) Renormalized log-likelihood t~ 'Wt{9) converges ¥o-alniost surely and in Li, as f tends to infinity, to 

1 1 

w{6) = lim -Ea(logQo{Xi:Nt,Yi:MtJ) = sup -Eo (logQfl(Xi:]Vt , YiiMj)) . 

t—t-oc t t t 

ii) Moreover, 'w{6o) ^ 'w{6) and strict ineguality is valid as long as at least one of these conditions is satisfied 

A. 6 eOo andy\> 0,Ee(ei) ^ XEo{ei), 

B. 9q,9 £ Qmarg and either f / /o or g j^ go- 

Hi) The family of functions {t^^wt{6)}t^i is uniformly equicontinuous on the set Qg, for any 5 > 0. 

The t hree stateme nts in Theorem [T] are the key ingredients to establish t he consistency of 0„ ,m, following Wald's classical 
proof (jWaldl . ll949l ) of maximum likelihood estimators consistency (see also Ivan der Vaard . Il998l . Chapter 5). The most subtle 
part is point ii), that states some sufficient conditions under which the maximum of the limiting function w is attained only 
at the true parameter value 6o. For instance, condition A means that the main directions of the paths {Zt}t^i generated 
under the two parameter values 9 and 9o are different. As a consequence, these two parameters may be easily distinguished 
from each other relying on the distribu tions Pe and Pp. Note th at these two different cases (condition ^ or _B satisfied) are 
certainly not exhaustive and we refer to lArribas-Gil et al.l (|2006h for simulation results indicating that point ii) might be true 
under other scenarios. Note in particular that for evolutionary models assuming that insertions and deletions happen at the 
same rate, Ee(ei) is proportional to (1, 1) for any parameter value 9. In this case, condition A) is never satisfied and we are 
not able to prove that h ^ ho or h y^ ho implies strict inequality w{9o) > w{9). This is due to the dependency structure on 
the sequences, that make it difficult to link the difference w(9 o) — w(9) with a Kullba ck-Leibler divergence between densities 
h and ho or between h and ho (see the proof of Theorem 2 in lArribas-Gil et al.l . l2006l . for more details). 



In general, one is not interested in the full (context) pair-HMM parameters but rather in a sub-parametrization induced by 
the choice of a specific evolutionary model (see for instance the sub-parametrization proposed in Section[5]). Let /3 i-> 9{l3) be 
a continuous parametrization from some set B to O. For any 5 > 0, let Bg = 9~^{Qg). We assume that j3o :— 9~^{9o) belongs 
to Bs for some S > 0. We then let 

/3„,m := argmaxw„,m(6'(/3)). 

Besides from giving results on the frequentist estimator /3„,m,, we consider Bayesian estimates of the parameters. Let ;/ be a 
prior probability measure on the set Bg. Markov chain Monte Carlo (MCMC) algorithms approximate the random distribution 
v„^m, interpreted as the posterior measure given observations Xi:„ and Yi;m and defined by 

,,„. _ Q0{l3){Xl:n,Yl:m)l^{dl3) 

''"■"'^ ^' ' Is^Qoinix^:u,Y,..^)uidp')- 

With these definitions at hand, we can now state the following corollary. 

Corollary 1: // the maximizer of P ^- w{9{l3)) over the set Bg is reduced to the singleton {Po}, then we have the 
following results 

i) I3n,m converges f'o- almost- surely to Po, 
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ii) If V puts some weight on a neighborhood of /So, then the sequence of posterior measures Un,m. converges in distribution, 
Po-almost surely, to the Dime mass at /3o- 

4. Algorithms 

The approach we use here to maximize criterion Q and provide a post e rior distribution on the s e t of a lignments, given 
two observed sequences relies on a s tocha stic version (jCeleux and Diebold . 1 1983 : IPiebolt and Celeuxl.ll993|) of em algor ithm 
(|Baum et al.l . ll97d : lDempster et al.l ■ ll977^ ■ This approach has already been successfully used in I Arribas-Gil et alj ((20091) in a 
classical pair-HMM (meaning with no context dependent substitution rates) in which case it gives results very similar to Gibbs 
sampling. It allows to perform maximum likelihood estimation in a framework in which the maximization is not explicit, the 
computation of the likelihood is computationally very expensive, and in which the use of em algorithm is not feasible (see 
Appendix for details). 

4.1 Forward and backward equations 

We describe below the forward and backward pr obabilities used in the later procedures . The formulas are easily obtained by 
generalizing those from the classical pair-HMM (|Arribas-Gill . l2007l : lDurbin et al.l . Il998l '). 

The forward equations are obtained as follows. For any value u £ £, let Sn^m = {e £ fn.m, e|(.| — u} be the set of paths 
ending at (n, m) with last step being u. Then the forward probabilities are defined for any i ^ n and j ^ m, as 

a"(i,j)=P9(3s^ l,Zs = (i,j),£, =U,Xl:i,Yl:j)= J^ Pfl (s = 6, Al:, , Yl.j ) . 

These forward probabilities are computed recursively in the following way. We first set the initial values 

\/ue£, 0^(0,-1) =q"(-1,0) =0 
Vue {Ix,Iy}, a"(0,0) = 0anda'*^(0,0) = 1. 
Then for i = 0, . . . ,n and j = 0, . . . ,m except (i, j) = (0, 0), we recursively compute 

a^'{i,j) = ¥e{3s ^l,es = (1,1), Z^ = (i, j), Ai., Yi.j) ^J^^ei^s ^ l,e« = (1,1), ^s = ii,j),es-i = u, Xi..i,Yi:,) 

uee 

= h{Xi,Y,\X,.i,Yj.i)7VMMa'''ii - 1, j - 1) + hiX„Yj)-Ki^Ma'''ii - l,j - 1) + ft(A„ F,)7r,^Ma'^ (i - l,j - 1), (3) 

and in the same way 

a'^'ihj) = f{Xi)'^TVuixa''{i-l,j), 
a'^'ihj) = giyj)^T^niYa''{i,j ~l). 

uS£ 

Note that the log-likelihood of the observed sequences is then simply obtained from the forward probabilities by taking 

Wn,7ni0) = logP6)(3s ^ l,Zs = (n, m) , Ai;„ , Fi™) = log(a^^ (n, m) + a'^ {n,m) + a^'{n,m)). 

Note that the forward equations could thus be used to compute the log-likelihood of two observed sequences and a numerical 
optimization of this quantity could give maximum likelihood estimates. However, such a strategy fails short as soon as the 
dimension of the parameter space is not small. Indeed, the computational cost of calculating the log-likelihood for a given 
parameter value is 0{nm}, and thus the computational cost of its maximization over a grid is 0{p^nm), where k is the number 
of parameters and p the number of values considered for each parameter. 

We now describe the backward probabilities and their recursive computation. For any value u £ £, the backward probabilities 
are defined as 

r(i,i) =P9(A,+i:„,yj+i,„13s ^1,Z, = {iJ),es^u,X„Yj). 
We initialize these values by taking, for any u £ £, 

I3^{n, m) — 1 and l3^{n, tti -(- 1) = /3"(n + 1, rn) = 0, 
and recursively compute, for i = n, . . . , 1 and j — m, . . . ,1, except (n, m), the quantities 
r(i,i) = 7^uMl3"{i + l,j + l){/i(A,+i,F,+i)l„g{,^,,^} + ^(A,+i,F,+iiA.,F,)U=M} 

+ nui^l3'^ii + l,j)f(X,+i)+7Vui^l3'^ii,j + l)<7(y,+i). (4) 

Note that contrarily to the case of HMM, the forward-backward probabilities do not give rise to the classical em strategy, 
as we do not obtain from these equations the conditional expectation of the complete log-likelihood, given the two observed 
sequences (see Appendix for further details). 
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4.2 sem and saem algorithms 

The forward-backward probabilitie s may be used to simulate t he hidde n path, conditional o n the observed sequences. Thus, em 
algorithm may be replaced by sem (jCeleux and Dieboltl . |l985|) or saem (JDelvon et al.l . Il999l ) procedures. Indeed, let us explain 
the basic idea of these stochastic versions of em algorithm. Denoting by L„m the random value s ^ 1 such that Z^ = (n, m) 
(the first and only hitting time for the point {n,m), which is not necessarily finite), the complete log-likelihood is 

logPe(Xl:„, Yl-jn, inm,£l:L„„) 

(see Appendix for details) and for a current parameter value 9' , its conditional expectation writes 

Qg,{e) :=Ee/(logPe(Xi:„,Fi.„„L„™,£i.i„„)|Xi.„,Yi:„). 

Now, the idea of these stochastic approximations is to replace the computation of Qqi (9) by an approximation obtained from 
the simulation (under parameter value 6') of a number of hidden paths {ei:s} in £n,m,, with possibly different lengths. These 
paths satisfy the property that for each obtained value of s, the paths with length s have the same occurrence probability 
as Pe' (e|Xi:„, Yi:m, inm = s). Notc that the length of each simulated path is random. Then the maximization with respect 
to 6 of the complete log-likelihood is done on the basis of the simulated alignments. As a consequence, this maximization is 
performed via a simple counting of specific events and does not require numerical procedures. 
In particular, iteration r of sem algorithm writes 

- Simulation step: generate one hidden path e^ of some random length s with the same distribution as 

^g{r-l){£l:s\Xl:„,Yl:rm Lnm = s). 

- Maximization step: 6'^'"' = argmax^gQ logPe(Xi:„, Y-i-m, e = e^). 
And iteration r of saem algorithm writes 

- Simulation step: generate m{r) hidden paths e^{j), j — 1, ■ ■ ■ ,m{r) with resulting random lengths Srj, each one having the 
same distribution as 

'f°e(''-l) (Sl-Sr.j l^l:n) Yl;m, Lnm = Srj). 

- Stochastic approximation step: update 

m(r) 

Q.(e) = Q,-i(e)+7r(^ VlogPe(Xi:„,yi:,„,e = e''(j))-Q,_i(e)), 

where {7r}r^i is a decreasing sequence of positive step size and Qo{d) = l/?7i(0) 5]]"Li logPe(^i:n, yL:m,£ = eP{j)) . 

- Maximization step: 6^^' — argmax^gQ (5^(6*). 

The simulation step is common to both algorithms and consists in drawing paths of some unknown length s with same 
distribution as Pg(r-i) (ei:s|-^i:n, Viim, inm = s). This is performed via the backwards sampling based on the forward 
probabilities as we now explain. We may write 

'P'e('--l) (£l:si^l:n, Yl-m, Lnm ~ s) = Pg(r-l) (es|^l;n, Yl-m, Lnm ~ s) _[_[ Pg(r-l) (efe|efc + l:s, Xl-.n, Yl:m, Lnm = s) 

fe = l 

s-1 
= Pg{^~l){es\Xl:n,Yl:m, Zs = {n,m)) Y[P gi^~l} {£k\ek+l , Zk + 1 , Xl-.N^ + i ,Yl:M^^^) . (5) 



The last equality comes from the following: Conditional on £fc+i, the random variable et does not anymore depend on the 
observations after time k + 2. If we moreover condition on the variables Sk+i-.s and Lnm ~ s, then we know which point 
(i, j) on the lattice corresponds to Z^.+i ~ (n, m) — X]t=fe+2 ^* ~ (*' ■?)• T^^^ values of the observed sequence after time fc + 2 
correspond exactly to Xi+i-.n, Yj+i-.m- Then conditional on {ek+i:s,Xi;n, Yi-.m, Lnm ~ s}, the variable £k only depends on Sk+i 
(according to the Markov property) and on the values Zk+i,X-i_;Nk+i,Yi:M^^-^. 

So, given Sk+i, the position Zk+i = {i,j) on the lattice and the observed values Xi;i,Yi-j, we want to sample Sk from 
Fg{r-i){£k\sk+i,Xi:i,Yi:j,Zk+i = (J,j)), for k — s,s — f,...,l. Up to a renormalizing constant, each step corresponds to 
sampling from ¥g(r-i){ek,£k+i\Xi:i, Yi-.j, Zk+i ~ {i,j)) or equivalently from 7r£^,e^^ja^''((z, j) — e^+i). Then the backwards 
sampling writes 

- e^ is sampled to be v with probability 

a^{n, m) 



a^{n, m) + a^^ {n, m) + a^^ (n, m) ' 
For k < s and Zk+i ~ {hj), * > 1,.? > 1, if ejl^i = M, then e^ is sampled to be M with probability 

a"' {i ~ l,j - l)-KMMhiX„Yj\X,-:,,Yj^i) 



(6) 



(7) 
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and to be u = Ix or ly with probability 

a''ii-l,j-l)-KuMh(Xi,Yj) 
aM(j,j) 

Note that these latter probabilities are correctly renormalized according to Equation Q. 

- For k < s and Zk+i ~ (J, j), i > 1, j > 1, if ejl^j = Ix then e]. is sampled to be it with probability 

- For k < s and Z^+i ~ {i,j), * > 1, J > l, if ^k+i = Iy then e^ is sampled to be u with probability 

Q"(z,j-lKiyg(yj) 

- For k < s and Z^+i = (1, j), j > 1, if e^+i ~ ^^ or ^x then all the values ej", 1 ^ Z ^ fc are fixed to Jy, otherwise ej. is 
sampled to be u with probability given by ((T)). 

- For k < s and Z^+i = (i, 1), i > 1, if e^_|_i = M ox Iy then all the values ej", 1 ^ / ^ fc are fixed to Ix, otherwise ej^ is 
sampled to be u with probability given by ((G)). 

According to Equation ((SJ, we thus have sampled paths e"" satisfying the following property. 

Proposition 1: The paths e € £n,m sampled from the above scheme with parameter value 9 occur with probability 

Pe(e = e|Xi:„, Yi:m,Ln„i = |e|). 

We now consider the complete log-likelihood obtained from the observed sequences and one simulated path. We have 

logP9(Xi;„, Yi:m,e = e'') = ^ le5-=„log7r„ + ^ ^ le^_j=u,<;;;=i, log7ru„ 

|e'-| |e'-| 

+ ^^[^^l=ix,x^^=aAogf{a) + le-=/y,y„^=„ log,g(a)] + X] 5Z '^^l=M,x„^=a,Y^^=b\ogh{a,h) 

+ Y. Y. H--l-,-^''i^^u .^". .^".-1 ,y,„,_, )=(a,i,,c,d) log /i(a, 6|c, d), 

where we recall that (n^, mk) = X]i=i ^I (for simplicity, we omit the subscript r). Then in the maximization step of sem, the 
components of 6 are updated to the values: Vu, v £ £, a,b,c,d £ A, 

{,.) ^ JVe-- (it) (r) ^ A^eK^^") f (r) / N ^ iVer(a|Jx) (r) , ^ ^ Aer(a|Jy) 

h^^\a,b)= ^^^i-MM) feW(„,fe|,,rf)^ N..ia,bmi;c,d) ^^^ 

E Ne-{a',b'\M) E AeKa',b'|AfAf;c,d) 

a'.b'^A a',b'£A 

where we use obvious notations for the counts Ne in sequence e. Namely for any u,v £ £, and any a,b,c,d £ A, we let 
Ne{u) = Y,k l{s*-- = ■"}' Ne(u,v) = X] l{efc_i = u,ek = v}, Ne{a\lx) = I] l{efc = Ix,Xn^ = a} and similarly for Ae(a|/y), 
iVe(a,6|M) and also Ae(a, 6|MM; c, d) = E l{efc = Sk-i = A/, (Aat^, YMt, A]v^_i, VAffc^i) = (a,6,c,d)}. 

In the maximization step of saem we have to take into account the m{r) hidden paths and the stochastic approximation 
of Qr{9). Then the values of 6 are updated as in ([8]) by replacing the counts Ne^i^-) by their counterparts Nr{-), where 

Nr = iV._, + > (^ Erir N.'-U) ^.-i) ■ 

From a theoretical point of view, the convergence properties of the two algorithms are different. In saem, the sequence 
{0^}r^i converges, under general conditions, to a local maximum of the log-likelihood. It is important to note that the 
stochastic perturbation introduced with respect to the original em algorithm allows avoiding saddle points, which are possible 
attractive stationary points of the sequence generated by em IjDelvon et al.l . Il999l ). In sem, the sequence {9^}r^i does not 
converge pointwise, b ut it is an homogeneous Markov chain which is ergodic under appropriate conditions, see for instance 
iDiebolt andlpllllQQd ). 

In this context, maximum likelihood estimators are sensitive to overfitting if there is insufficient data. Indeed, if an event has 
never been observed in the sequ ences, the corr e spond ing estimator is not well defined, as both the numerator and denominator 
in jSl) are zero. Pseudo -counts (jDurbin et al.l . Il998l . Chapter 3) or other smoothing techniques such as the one proposed in 
iKneser and NevI (|l995h may be used to solve this problem. However, when using saem algorithm to perform maximum 
likelihood estimation, this problem is minimized and no smoothing strategies are required. Indeed, since at iteration r we 
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Table 1 

Parameter values in the two data sets. 

Data set 1 a = 0.4 13 = 0.2 -y = 0.06 A = 0.04 

Data set 2 a = 0.5 /3 = 0.15 7 = 0.05 A = 0.02 

generate m{r) hidden paths, the counts Nr (computed as an average over those ahgnments) are rarely equal to 0, even during 
the first iterations when 7r is typically set to 1. However, if we want to completely avoid this possibility, we may replace 
event counts by pseudo-counts in the case of unobserved events, during these first iterations. Note also that in practice we 
do not generate whole alignments at each iteration step, but only sub-alignments of a certain given length within the current 
alignment. This str ategy combines an MCM C procedure within saem and is known to preserve convergence properties of 
saem algorithm fsee lArribas-Gil et al.l . l2009l . for further details). In terms of event counts, this strategy implies that a current 
low estimate of an event occurrence probability will have some impact on next iteration estimate. Nonetheless, at the last 
iterations of the algorithm, when typically 7,. < 1, current counts are averaged with counts from previous iterations, which 
provides a smoothing procedure by itself. 

5. Simulations 

For these simulations, we consider nucleotide sequences with alphabet A = {A,C,G,T} and a constrained parameter set, 
motivated by the choice of an underlying evolutionary model on these sequences. As for the gene ration of the hidden Markov 
chain, we consider the TKF91 indel model with equal value for insertion and deletion rate A (see lArribas-Gill . 120071 . for more 
details) . The transition matrix of the hidden path is thus given by 

/ e-^ 1-e-^ A \ 

; r A 1 + A- 



1 + A 



h{x, y) = } 



1-e- 
\ e-^ \~e-^ A / 

where the order of the states in the matrix is M, Ix and Iy- Then, we describe the conditional distribution of the sequences. 
For any a- G ^, we set /(x) = g{x) = jix, where ^ is going to be the stationary distribution of the substitution process, 
according to the parametrization of h and h considered below. We set the distribution h according to a simple substitution 
process with equal substitution rate 7 and different nucleotides frequencies (modified Jukes Cantor). Let 

M^(l -e~'^)/iy ifs^/j/ 

Ma:[Mi^(l - e"^) + e"^] \ix = y. 

The parameter h accounting for the context in the substitution process differs from h only in a ^ match context. In this way, 
we want to take into account a possibly higher substitution rate from the pair CpG to CpA. More precisely, we let 

P(AV. = X, n., =y\e. = e... = M, A.._, = .', n.._, = y') = [ ^^ ^^^ 1^^ = ^' 

In aS match context, we use a model initially introduced in two slightly different forms bv lFelsensteinI ([198J) and lHasegawa et al.l 
(| 19851 ). This model considers different substitution rates for transitions (i.e. substitutions within the chemical class of purines 
TZ = {A, G} or pyrimidines y = {C, T}) and for transversions (substitutions modifying the chemical class) as well as different 
nucleotide frequencies /i. Denoting by x the other nucleotide in the same chemical class as x (namely x ^ x and either 
{x, x} £TZ or {x, x] £y), we get 

r M=.e-("+« + /i.e"''(l - e-")7i^ + m41 - 6"")^. if y = ^, 
hc{x,y)=l ^^e-''(l-e-")^^^+M4l-e~'')M^ if J/ = i, 

I ^j;(l — e^^)^y otherwise. 

We refer to iFelsensteinI ((20031) for more details on substitution processes. 

According to this parametrization we have simulated two sets of data with an alignment length of 2000 base pairs (bp) and 
parameter values described in Table [l] In these data sets, the substitution rate 7 is larger than the indel rate A as expected 
in biological sequences, and the transition and transversion substitution rates in a ^ match context (a and /3) are larger than 
the regular substitution rate 7. In both sets of data we have set the stationary distribution of the substitution process to 

^A = 0.225 ^JLc = 0.275 no = 0.275 ht = 0.225, 

in order to obtain a GC content (proportion of Gs or Cs) larger than 50%. 

For each one of the data sets we have conducted two estimation procedures. On the one hand, we have estimated the whole 
parameter {tt, f,g,h,h) via the saem algorithm as explained in Section [4.21 On the other hand, we have combined the saem 
algorithm with numerical optimization to estimate only the evolutionary parameters A, 7, a and /3. The first procedure 
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has the advantages of being robust against niisspecification of the underlying evolutionary model and relying on a explicit 
maximization step based on the counts of events. The second procedure is computationally more expensive, however, it is 
more parsimonious and it provides a straightforward evolutionary interpretation of the parameter values. In both cases we 
have performed 150 iterations of the saem algorithm with the parameter 7^ of the stochastic approximation set to 1 for 
r = 1, . . . , 100 and to l/(r — 100) for r = 101, . . . , 150. The number of simulated hidden paths is m{r) = 5, r = 1, . . . , 20 and 
m{r) = 10, r = 21, . . . , 150. We used the same initial values of the parameters for every simulation run over the two simulated 
data sets. These initial values, estimates and standard deviations are given in Tables (2] [3] (first procedure) and |4] (second 
procedure). In Figure [2] we present the distribution of estimates obtained with the second estimation approach for the two 
simulated data sets. We can observe that both procedures provide unbiased and very accurate estimates. In FigureOwe show 
the convergence of the saem estimates on a single simulation run from the second data set. We use a logarithmic scale for the 
X-axis to highlight the very fast convergence of the parameters to their true values. 



ql 



20 
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Figure 2. Distribution of saem estimates of the reduced parameter vector (q, /?, 7, A) over 100 simulation runs for the two 
simulated data sets (Data set 1 on top. Data set 2 on bottom). The real values of the parameters are displayed in dotted line. 
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Figure 3. saem estimates of the reduced parameter vector (a, /3, 7, A) over iterations in a single simulation run from the 
second data set. A logarithmic scale is used for the i-axis. The true parameter values are displayed in dotted line. 



6. Application to real data 

In this section we illustrate our method through an alignment of the human alpha-globin pseudoge ne (HBPAl) w ith its 
functional counterpart, the human alpha-globin gene (HBAl). This example is inspired by the work of lBulmen (| 19861 ). who 
studied the neighbouring base effects on substitution rates in a series of vertebrate pseudogenes, including the human alpha- 
globin, concluding that there is an increase in the frequency of substitutions from the dinucleotide CpG. 
We have extracted the sequences of the gene and the pseudog ene, who are located in the human alpha-globin gene cluster 
on chromosome 16, from the UCSC Genome Browser database (jFuiita et al.l . l2010l ). We have considered the whole sequences 
(coding and non-coding sequence in the case of the functional gene) for the alignment, obtaining sequences lengths of 842 bp 
for HBAl and 812 bp for HBPAl. Because of the presence of introns and exons in the HBAl sequence, it is natural to consider 
a model allowing for two different substitution behaviors along the sequence. That is why we have used a model for which 
the state space of the hidden Markov chain is {Mi, M2, Ix,Iy}, where Mi, i = 1,2, stand for two different match states, with 
different associated substitution processes hi, hi, i — 1,2. This kind of model may also handle fragments insertion and deletion 
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Table 2 

Mean values and standard deviations of estimates over 100 simulation runs for Data Set 1. The estimation has been 
performed without taking into account the specific parametrization used to generate the data. 



Initial value True value Estimate (sd) 



ttma/ 


0.85 


0.9238 


0.9184 


0.0101) 


T^MIx 


0.075 


0.0377 


0.0407 


0.0074) 


T^MIy 


0.075 


0.0385 


0.0410 


0.0077) 


T^Ix M 


0.85 


0.9424 


0.8807 


0.0998) 


'^Ixlx 


0.075 


0.0385 


0.0387 


0.0308) 


T^IxIy 


0.075 


0.0191 


0.0805 


0.1034) 


TTlyM 


0.85 


0.9238 


0.8788 


0.0982) 


niYlx 


0.075 


0.0377 


0.0821 


0.0987) 


t^IyIy 


0.075 


0.0385 


0.0391 


0.0305) 


hiA,A) 


0.0625 


0.2148 


0.2155 


0.0109) 


hiA,C) 


0.0625 


0.0036 


0.0031 


0.0019) 


h{A,G) 


0.0625 


0.0036 


0.0032 


0.0019) 


h{A,T) 


0.0625 


0.0029 


0.0027 


0.0018) 


h[C,A) 


0.0625 


0.0036 


0.0034 


0.0021) 


h{C, C) 


0.0625 


0.2634 


0.2628 


0.0130) 


h{C,G) 


0.0625 


0.0044 


0.0044 


0.0024) 


h{C,T) 


0.0625 


0.0036 


0.0034 


0.0021) 


h{G,A) 


0.0625 


0.0036 


0.0030 


0.0017) 


h{G,C) 


0.0625 


0.0044 


0.0041 


0.0023) 


h{G,G) 


0.0625 


0.2634 


0.2653 


0.0120) 


KG,T) 


0.0625 


0.0036 


0.0032 


0.0018) 


h{T, A) 


0.0625 


0.0029 


0.0026 


0.0016) 


h[T, C) 


0.0625 


0.0036 


0.0031 


0.0017) 


h{T, G) 


0.0625 


0.0036 


0.0035 


0.0021) 


h{T, T) 


0.0625 


0.2148 


0.2166 


0.0113) 


hc{A,A) 


0.0625 


0.1600 


0.1626 


0.0169) 


hc{A,C) 


0.0625 


0.0112 


0.0108 


0.0053) 


hc{A,G) 


0.0625 


0.0446 


0.0454 


0.0111) 


hc{A,T) 


0.0625 


0.0092 


0.0091 


0.0051) 


hc{C,A) 


0.0625 


0.0112 


0.0102 


0.0056) 


hc{C,C) 


0.0625 


0.2055 


0.2072 


0.0193) 


hc{C,G) 


0.0625 


0.0137 


0.0126 


0.0056) 


hc{C,T) 


0.0625 


0.0446 


0.0429 


0.0108) 


hc{G,A) 


0.0625 


0.0446 


0.0440 


0.0098) 


hc{G,C) 


0.0625 


0.0137 


0.0140 


0.0061) 


hc{G,G) 


0.0625 


0.2055 


0.2050 


0.0174) 


hc{G,T) 


0.0625 


0.0112 


0.0110 


0.0053) 


hc{T,A) 


0.0625 


0.0092 


0.0086 


0.0047) 


hc{T,C) 


0.0625 


0.0446 


0.0456 


0.0106) 


hc{T,G) 


0.0625 


0.0112 


0.0111 


0.0062) 


hc{T,T) 


0.0625 


0.1600 


0.1598 


0.0197) 



in contrast to single nucleotide indels (see lArribas-Gil et al.l . l2009l . for more details). As in the simulation studies of the previous 
section, h is the substitution matrix in a ^ match context, and h is the substitution matrix in any other case. In this way, 
we want to take into account a possibly higher transition rate from nucleotide G occurring in the dinucleotide CpG. In order 
to avoid any possible misspecification of the underlying indel and substitution processes, we have decided not to parametrize 
any of those, conducting an estimation approach equivalent to the first procedure described in the previous section. The only 
difference here lies in the dimension of the model parameter, since this parameter is now (vr, /, g,hi,h2,hi,h2), where tt is a 
4x4 transition probability matrix, and hi, hi, i — 1,2 are four different 4x4 stochastic vectors, yielding a total number of 
12 + 3 + 3 + 4 X 15 = 78 free parameters. 
We have run saem algorithm on the sequences, performing 500 iterations with 7r set to 1 for r = 1, . . . , 400 and to l/(r — 400) 
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Table 3 

Mean values and standard deviations of estimates over 100 simulation runs for Data set 2. The estimation has been 
performed without taking into account the specific parametrization used to generate the data. 



Initial value True value Estimate (sd) 



TTa/a/ 


0.85 


0.9610 


0.9559 


0.0071) 


TTm/x 


0.075 


0.0194 


0.0221 


0.0051) 


T^MIy 


0.075 


0.0196 


0.0219 


0.0055) 


T^Ix M 


0.85 


0.9706 


0.8477 


0.1262) 


'^Ixlx 


0.075 


0.0196 


0.0199 


0.0221) 


T^IxIy 


0.075 


0.0098 


0.1323 


0.1289) 


TTlyM 


0.85 


0.9610 


0.8407 


0.1321) 


niYlx 


0.075 


0.0194 


0.1367 


0.1379) 


t^IyIy 


0.075 


0.0196 


0.0226 


0.0252) 


hiA,A) 


0.0625 


0.2165 


0.2168 


0.0114) 


hiA,C) 


0.0625 


0.0030 


0.0027 


0.0015) 


h{A,G) 


0.0625 


0.0030 


0.0027 


0.0014) 


h{A,T) 


0.0625 


0.0025 


0.0024 


0.0015) 


h[C,A) 


0.0625 


0.0030 


0.0026 


0.0017) 


h{C, C) 


0.0625 


0.2653 


0.2673 


0.0122) 


h{C,G) 


0.0625 


0.0037 


0.0033 


0.0020) 


h{C,T) 


0.0625 


0.0030 


0.0027 


0.0016) 


h{G,A) 


0.0625 


0.0030 


0.0023 


0.0015) 


h{G,C) 


0.0625 


0.0037 


0.0033 


0.0020) 


h{G,G) 


0.0625 


0.2653 


0.2643 


0.0096) 


KG,T) 


0.0625 


0.0030 


0.0023 


0.0013) 


h{T, A) 


0.0625 


0.0025 


0.0024 


0.0015) 


h[T, C) 


0.0625 


0.0030 


0.0025 


0.0016) 


h{T, G) 


0.0625 


0.0030 


0.0027 


0.0015) 


h{T, T) 


0.0625 


0.2165 


0.2196 


0.0102) 


hc{A,A) 


0.0625 


0.1588 


0.1595 


0.0159) 


hc{A,C) 


0.0625 


0.0086 


0.0069 


0.0038) 


hc{A,G) 


0.0625 


0.0505 


0.0504 


0.0096) 


hc{A,T) 


0.0625 


0.0071 


0.0067 


0.0042) 


hc{C,A) 


0.0625 


0.0086 


0.0087 


0.0045) 


hc{C,C) 


0.0625 


0.2053 


0.2053 


0.0199) 


hc{C,G) 


0.0625 


0.0105 


0.0103 


0.0046) 


hc{C,T) 


0.0625 


0.0505 


0.0504 


0.0101) 


hc{G,A) 


0.0625 


0.0505 


0.0515 


0.0102) 


hc{G,C) 


0.0625 


0.0105 


0.0097 


0.0052) 


hc{G,G) 


0.0625 


0.2053 


0.2036 


0.0191) 


hc{G,T) 


0.0625 


0.0086 


0.0078 


0.0043) 


hc{T,A) 


0.0625 


0.0071 


0.0078 


0.0041) 


hc{T,C) 


0.0625 


0.0505 


0.0515 


0.0098) 


hc{T,G) 


0.0625 


0.0086 


0.0084 


0.0040) 


hc{T,T) 


0.0625 


0.1588 


0.1613 


0.0166) 



for r = 401, . . . , 500. The number of simulated hidden paths is m{r) = 5, r — 1, . . . , 20 and m{r) = 10, r — 21, . . . , 500. 
In Table [5] we present the estimated substitution probabilities from nucleotide G. As expected, we found an increase in the 
substitution occurrence for G in a ^ match context. We present in Figure U the posterior probability distribution over the 
set of possible alignment columns, match, insertion and deletion (the two match states have been merged) for every pair of 
aligned positions in a consensus alignment obtained at the convergence stage (last iterations) of saem algorithm. 

In order to compare our method to a traditional pair-HMM alignment, we have also run saem algorithm, with the same 
settings as before, on a simpler model without substitution context-dependence. In this model we consider again two different 
substitution regimes, so the parameter vector is now (tt, /, g, hi, /12), yielding a total number of 12 + 3 + 3 + 2 x 15 = 48 free 
parameters. We present in Figure [S] the posterior probability distribution over the set of possible alignment columns, match. 
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Table 4 

Mean values and standard deviations of estimates over 100 simulation runs for the two data sets. The estimation has been 
performed by numerical optimization on the reduced parameter vector (a,/3,7, A). 

Data set 1 





Initial value 


True value 


Estimate (sd) 


a 

13 

7 
A 


0.8 
0.25 

0.1 
0.08 


0.4 
0.2 
0.06 
0.04 


0.400 (0.0750) 
0.1985 (0.0372) 
0.0602 (0.0087) 
0.0402 (0.0040) 


Data set 2 




Initial value 


True value 


Estimate (sd) 


a 

7 
A 


0.8 
0.25 

0.1 
0.08 


0.5 
0.15 
0.05 
0.02 


0.5091 (0.0772) 
0.1493 (0.0296) 
0.0499 (0.0079) 
0.0196 (0.0024) 



Table 5 

Relative substitution probabilities for G, namely cj>i{G, •)/ J^^^g^ 4'i{G, a), i = 1,2, <j) = h,h. The probability of transition for 

G increases from 0.08 m the general context to 0.88 in the c m,atch context in the first substitution regime, and from 0.02 to 

0.15 in the second. 





A 


C 


G 


T 


hi 


0.0839 


0.0759 


0.8398 


0.0004 


hi 


0.8835 


0.0093 


0.0951 


0.0121 


h2 


0.0242 


0.0596 


0.8681 


0.0481 


h2 


0.1508 


0.1420 


0.7058 


0.0014 



insertion or deletion (the two match states have been merged) for every pair of aligned positions in a consensus alignment 
obtained at the convergence stage (last iterations) of saem algorithm. This posterior probability is quite similar to the one 
obtained with the context-dependent model. 

In order to further compare the two models, we have calculated a BIC-type criterion penalizing the maximum likelihood by 
the number of parameters. Specifically, we have computed BIC = — 2L + fclog(n), where L is the value of the log-likelihood 
at 8ml, n is the sample size and k is the total number of free parameters. We have taken n — 842, the length of the longest 
sequence, since the length of the alignment is unknown, but is equivalent to those of the observed sequences. Indexing by 1 
the context-dependent model and by 2 the non context-dependent one, we get 

BICi = -2 ■ (-7128.50) + 781og(842) = 14782.39 

BIC2 = -2 • (-7337.40) + 481og(842) = 14998.12, 

which means that the context-dependent model better fits this data set. 



7. Discussion 

We have proposed a model for statistical sequence alignment accounting for context-dependence in the substitution process. 
This model extends the pair hidden Markov model in the sense that, conditional on the alignment, the presence of particular 
nucleotides at the different sites of a sequence are no longer independent. We have adapted the traditional dynamic programing 
algorithms to the new framework, and we have proposed an efhcient estimation method based on saem algorithm. We have 
obtained asymptotic results for maximum likelihood estimation on the new model, and through a simulation study, we have 
shown the accuracy of our algorithms on finite sample estimation. Finally, we have illustrated our methods with the alignment 
of the human alpha- globin pseudogene and its functional counterpart. We have compared the new model with a classical pair- 
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Figure 4. Posterior probability distribution of the alignment states at the maximum likelihood parameter estimate with 
the substitution context-dependent model. 
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Figure 5. Posterior probability distribution of the alignment states at the maximum likelihood parameter estimate without 
taking into account possible substitution context-dependence. 

HMM through a model selection criterion, concluding that taking into account the context-dependence on the substitution 
process improves the fitting of the data. 
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Appendix 

The need for a stochastic approximation 

Let us explain why em algorithm may not be applied in pair-HMM. Denoting by Lnm the random value s ^ 1 such that 

Zs — (n,m) (the first and only hitting time for the point (n,m), which is not necessarily finite), the complete log-likelihood 

writes 

n-\-m s 

s — nVm uG£ k — 2u,v££'^ 

s s 

+ J2J2^^^''=ix,Xt,^=a'\-Ogf{a) + 1,^=Iy,Yj,i^^, log 3(a)] + ^ ^ l^^^,^,_^^_^^M,x^^=a,YM^^blogh{a,b) 

k=laeA k = la,b£A^ 

s 
+ Y1 Yl '^^k=^k-l=M,iXN^:,YM^^,Xi^^^^,YM^_,) = (a,b,c,d)^Ogh{a,b\c,d)j. 

k^2 a,b,c,d£A'^ 

To simplify notations, we let X := Xi:„,Y :— Y\:m,L :— Lnm and J^s = {X,Y,Z/ = s}. Moreover, we let {e,X,Y)k ■ — 
{ek,XNf. , Ym^,). Taking the expectation of the complete log-likelihood, conditional on the observed sequences, under a current 
parameter value 9' , leads to 

n + TTi 

Ee,(logPe(X,Y,L,£i.i)iX,Y)= Y. lP9'(i = s|X,Y){^Pe'(ei=«iJ^.)log7r„ 

s 

+ ^ ^ fe'ii£k-i,ek) ^ iu,v)\Ts)logTruv 

s 

+ EE [Pe'iie, X)k = {Ix,a)\Ts)\og f{a)+¥e,i{e,Y)k ^ {lY,a)\Ts)\ogg{a)] 

k—1 a£A 
s 

+ Y,Y.^e'{ek-i 7^ M,(e,X,Y)k ^ {M,a,b)\Ts)\ogh{a,b) 

k — 1 a.b 

s 

+ Y.Y. ^9'i{ek,ek-i, (X, Y)k, (X, Y)k-.i) = (M, M, a, b, c, d)\J^s) log h{a, b\c, d)}. 

fc — 2 a,b.c,d 

There are two main issues at stake here. First, the quantity fe'{Lnm\Xi;n,Yi:m,) is not given by the forward-backward 
equations. Second, computing this conditional expectation would also require the knowledge of the quantities 

Pe.(efc_i,efe|X,Y,L = s). 

However, the forward backward equations rather give access to quantities as for instance 

a'^it -l,j- l)ni^Mh{X,,Y,)p"(t,j) = P9(efc_i = Ix,ek = A/|X, Y,^fe = {i,j)). 

In other words, the conditional distribution of {ek-i,£k) is known only conditional on the extra knowledge of the position Zk 
of the path in the lattice at time k. In conclusion, in pair-HMM, the forward-backward equations are not sufHcient to compute 
the expectation of the complete log-likelihood, conditional on the observed sequences. 



